Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Feiyang Huang, UID 306148942

Display machine information for reproducibility:

sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.2 compiler_4.2.3    fastmap_1.1.1     cli_3.6.1        
 [5] tools_4.2.3       htmltools_0.5.5   rstudioapi_0.14   yaml_2.3.7       
 [9] rmarkdown_2.23    knitr_1.43        jsonlite_1.8.7    xfun_0.39        
[13] digest_0.6.32     rlang_1.1.1       evaluate_0.21    

Load necessary libraries (you can add more as needed).

library(arrow)
Warning:
  It appears that you are running R and Arrow in emulation (i.e. you're
  running an Intel version of R on a non-Intel mac). This configuration is
  not supported by arrow, you should install a native (arm64) build of R
  and use arrow with that. See https://cran.r-project.org/bin/macosx/

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(memuse)
library(pryr)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.25.0 (2022-06-12 02:20:02 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.2 (2022-11-11 22:00:03 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:arrow':

    schema

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Display your machine memory.

memuse::Sys.meminfo()
Totalram:  8.000 GiB 
Freeram:   1.488 GiB 

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

Answer:

First Read the data:

labevents_tble <- arrow::open_dataset("~/mimic/hosp/labevents.parquet")
patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz")
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz")
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
transfers_tble <- read_csv("~/mimic/hosp/transfers.csv.gz")
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
procedures_icd_tble <- read_csv("~/mimic/hosp/procedures_icd.csv.gz")
Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): icd_code
dbl  (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
diagnoses_icd_tble <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz")
Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_procedures_tble <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz")
Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_diagnoses_tble <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz")
Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
patients <- read_csv("~/mimic/hosp/patients.csv.gz")
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Filter data for the patient with subject_id 10013310

patient_id <- 10013310
patient_admissions <- admissions_tble %>%
  filter(subject_id == patient_id)

patient_transfers <- transfers_tble %>%
  filter(subject_id == patient_id)

labevents_tble <- as_tibble(labevents_tble)
patient_labevents <- labevents_tble %>%
  filter(subject_id == patient_id)

patient_procedures <- procedures_icd_tble %>%
  filter(subject_id == patient_id)

patient_diagnoses <- diagnoses_icd_tble %>%
  filter(subject_id == patient_id)

patients_info <- patients %>%
  filter(subject_id == patient_id)

Merge relevant data tables:

patient_procedures <- patient_procedures %>%
  left_join(d_icd_procedures_tble, by = "icd_code") %>%
  select(subject_id, hadm_id, icd_code, chartdate, long_title)

patient_diagnoses <- patient_diagnoses %>%
  left_join(d_icd_diagnoses_tble, by = "icd_code") %>%
  select(subject_id, hadm_id, icd_code, long_title)

Visualize the ADT history:

# Select ICU/CCU rows
icu_ccu_rows <- patient_transfers[grepl("(ICU|CCU)", patient_transfers$careunit), ]
# Make sure the data is in right format
patient_procedures$chartdate <- as.POSIXct(patient_procedures$chartdate)
patient_transfers <- patient_transfers[complete.cases(patient_transfers$careunit, 
                                                      patient_transfers$intime, 
                                                      patient_transfers$outtime), ]
patient_procedures$long_title <- as.factor(patient_procedures$long_title)
# Let the y-axis be the type of event (ADT, lab, procedure)
convert_to_text <- function(x) {
  y_labels <- c("  ","Procedure", "  "," Lab", "  ", "ADT")
  y_labels[ceiling(x * length(y_labels))]
}

# Make the legend more readable
truncate_legend <- function(x, max_chars) {
  ifelse(nchar(x) > max_chars, substr(x, 1, max_chars), x)
}

# Find the top three diagnoses
title_freq <- table(patient_diagnoses$long_title)

top_three_titles <- names(sort(title_freq, decreasing = TRUE)[1:3])

subtitle <- sprintf("\n%s\n%s\n%s", top_three_titles[1], 
                    top_three_titles[2], top_three_titles[3])
ggplot() +
  geom_segment(data = patient_transfers, aes(x = intime, xend = outtime, 
                                             color = careunit, y = 1, yend = 1, 
                                             linewidth= ifelse(
                                               transfer_id%in%icu_ccu_rows$transfer_id, 
                                               2, 1.5))) +
  geom_point(data = patient_labevents, aes(x = charttime, y = 0.6), shape = "x") +
  geom_point(data = patient_procedures, 
             aes(x = chartdate, y = 0.2, shape = long_title)) +
  scale_shape_manual(values = 0:10,labels = truncate_legend(
    levels(patient_procedures$long_title),15))+
  scale_x_datetime(labels = scales::date_format("%m-%d")) +
  scale_y_continuous(labels = convert_to_text) +
  labs(title = paste("Patient", patient_admissions$subject_id,patients_info$gender,
                     patients_info$anchor_age,
                     patient_admissions$race),
       subtitle = subtitle,
       x=("Calendar Time"), y=(" "))+
  guides(linewidth = "none", shape = guide_legend(title = "Procedure",ncol = 1), 
         color = guide_legend(ncol  = 1, title = "Care Unit"))+
  theme_minimal()+
  theme(legend.position = "bottom")

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310.

Answer:

zcat < ~/mimic/icu/chartevents.csv.gz > ~/mimic/icu/chartevents.csv
icu_stays <- read_csv("~/mimic/icu/icustays.csv.gz")
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_items <- read_csv("~/mimic/icu/d_items.csv.gz")
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chartevents <- arrow::open_dataset("~/mimic/icu/chartevents.parquet")
# Select the patient
icu_stays_patient <- icu_stays %>%
  filter(subject_id == 10013310)


chartevents_patient <- chartevents %>%
  filter(subject_id == 10013310)
chartevents_patient <- as_tibble(chartevents_patient)

# Select the items that we are interested in
condition <- d_items$linksto == "chartevents" & d_items$abbreviation %in% 
  c("HR", "NBPd", "NBPs", "RR", "Temperature F")
filtered_items <- d_items$itemid[condition]
d_items_need <- d_items %>%
  filter(itemid %in% filtered_items)
# Merge the data
chartevents_patient <- chartevents_patient %>%
  filter(itemid %in% filtered_items) %>%
  left_join(d_items_need, by = "itemid") %>%
  select(subject_id, stay_id,charttime, value, valuenum, abbreviation, itemid)
# Make sure the data is in right format
chartevents_patient$charttime <- as.POSIXct(chartevents_patient$charttime)
chartevents_patient$stay_id <- as.factor(chartevents_patient$stay_id)
chartevents_patient$abbreviation <- as.factor(chartevents_patient$abbreviation)
ggplot() +
  geom_line(data = chartevents_patient, aes(x = charttime, y = valuenum, 
                                            group = interaction(stay_id, 
                                                                abbreviation), 
                                            color = interaction(stay_id, 
                                                                abbreviation))) +
  geom_point(data = chartevents_patient, aes(x = charttime, y = valuenum, 
                                             group = interaction(stay_id, 
                                                                 abbreviation), 
                                             color = interaction(stay_id, 
                                                                 abbreviation))) +
  labs(title = paste("Stay ID:", chartevents_patient$stay_id),
       x = "Time",
       y = "Value") +
  facet_grid(cols = vars(stay_id), rows = vars(abbreviation), scales = "free") +
  guides(color = FALSE) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 20, hjust = 1),
        strip.background = element_rect(fill = "lightgray")) +
  scale_x_datetime(labels = scales::date_format("%m-%d %H:%M")) +
  scale_color_manual(values = c("orange", "orange", "yellow4", "yellow4", 
                                "springgreen3",  "springgreen3", "deepskyblue",
                                "deepskyblue", "magenta2", "magenta2"))
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head
subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.9481134259259258
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.171817129629629
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.3143518518518518
10002155,20345487,32358465,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-03-09 21:33:00,2131-03-10 18:09:21,0.8585763888888889
10002155,23822395,33685454,Coronary Care Unit (CCU),Coronary Care Unit (CCU),2129-08-04 12:45:00,2129-08-10 17:02:38,6.178912037037037

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

Answer:

icustays_tble <- as_tibble(read_csv("~/mimic/icu/icustays.csv.gz"))
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

Answer: Here is the number of unique subject_id:

unique_subject_id <- icustays_tble %>%
  distinct(subject_id) %>%
  count()

unique_subject_id
# A tibble: 1 × 1
      n
  <int>
1 50920

Yes, a subject_id can have multiple ICU stays. Here is the summary of the number of ICU stays per subject_id:

icu_stays_summary <- icustays_tble %>%
  distinct(subject_id, stay_id) %>%
  group_by(subject_id) %>%
  count() %>%
  arrange(desc(n))
head(icu_stays_summary, 5)
# A tibble: 5 × 2
# Groups:   subject_id [5]
  subject_id     n
       <dbl> <int>
1   18358138    37
2   12468016    33
3   17585185    33
4   13269859    30
5   18676703    26

Here is the graph of the number of ICU stays per subject_id:

plot <- ggplot(data= icu_stays_summary) +
  geom_bar(mapping = aes(x = n)) +
  labs(title = "Number of unique subject_id",
       x = "Count ICU stays",
       y = "Count subject_id") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  scale_x_discrete(limits = icu_stays_summary$n)
Warning: Continuous limits supplied to discrete scale.
ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?
plotly_plot <- ggplotly(plot)
plotly_plot

Over 75% of the patient only have one ICU stay.

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

Answer:

admissions_tble <- as_tibble(read_csv("~/mimic/hosp/admissions.csv.gz"))
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

Answer:

Number of admissions per patient:

admissions_summary <- admissions_tble %>%
  distinct(subject_id, hadm_id) %>%
  group_by(subject_id) %>%
  count() %>%
  arrange(desc(n))
plot <- ggplot(data= admissions_summary) +
  geom_bar(mapping = aes(x = n)) +
  labs(title = "Number of admissions per patient",
       x = "Count admissions",
       y = "Count subject_id") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

plotly_plot <- ggplotly(plot)
plotly_plot

We can see that over 100,000 patients have only one admission, about 35,000 patients have two admissions, and about 15,000 patients have three admissions. Only 15.43% of patients have more than 3 admissions.

Admission hour:

admissions_tble$admittime <- as.POSIXct(admissions_tble$admittime, format = "%Y-%m-%d %H:%M:%S")
admissions_tble$hour <- hour(admissions_tble$admittime)
hour_counts <- admissions_tble %>%
  count(hour)
plot <- ggplot(hour_counts, aes(x = hour, y = n)) +
  geom_bar(stat = "identity") +
  labs(title = "Count of Admissions by Hour",
       x = "Hour",
       y = "Count") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
plotly_plot

From the graph, we can see that the number of admissions is relatively low between 1:00 ~ 6:00, and 8:00 ~ 13:00, and relatively high between 16:00 ~ 24:00. It implies the fact that most ICU admissions occur during the night and afternoon.

Admission minute:

admissions_tble$minute <- minute(admissions_tble$admittime)
minute_counts <- admissions_tble %>%
  count(minute)
plot <- ggplot(minute_counts, aes(x = minute, y = n)) +
  geom_bar(stat = "identity") +
  labs(title = "Count of Admissions by Minute",
       x = "Minute",
       y = "Count") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
plotly_plot

Most of the admissions occur at every quarter of the hour, and have a relative avergae occurance at the rest of the minutes.

Length of hospital stay:

admissions_tble$dischtime <- as.POSIXct(admissions_tble$dischtime, 
                                        format = "%Y-%m-%d %H:%M:%S")
admissions_tble$admittime <- as.POSIXct(admissions_tble$admittime, 
                                        format = "%Y-%m-%d %H:%M:%S")
admissions_tble$los <- as.numeric(difftime(admissions_tble$dischtime, 
                                           admissions_tble$admittime, units = "days"))
plot <- ggplot(data = admissions_tble) +
  geom_histogram(mapping = aes(x = los),bins = 100) +
  labs(title = "Distribution of Length of Hospital Stay",
       x = "Length of Hospital Stay",
       y = "Count") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
plotly_plot

From the graph, we can see that the distribution of length of hospital stay is right-skewed, with most patients stay in the hospital for less than 10 days, and it’s unusual that some patients has a length of hospital stay that is negative number.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head
subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,
10000178,F,59,2157,2017 - 2019,

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

Answer:

patients_tble <- as_tibble(read_csv("~/mimic/hosp/patients.csv.gz"))
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

gender_counts <- patients_tble %>%
  count(gender)

ggplot(gender_counts, aes(x = "", y = n, fill = gender)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Gender Distribution",
       fill = "Gender") +
  scale_fill_manual(values = c("lightpink1", "lightblue2")) +
  geom_text(aes(label = paste0(n, " (", round((n/sum(n))*100), "%)")), 
            position = position_stack(vjust = 0.5)) +
  theme_void()

From the graph, we can see that the gender distribution is relatively balanced, with 53% of patients are female and 47% of patients are male.

plot <- ggplot(data = patients_tble) +
  geom_histogram(mapping = aes(x = anchor_age, fill = gender), bins = 100, alpha = 0.8, position = "identity") +
  labs(title = "Distribution of anchor_age by Gender",
       x = "anchor_age",
       y = "Count") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
plotly_plot

From the graph, we can see that the distribution of anchor_age is right-skewed, with most patients are between 20 and 25 years old and around 13% of the patients are older than 75 years old.

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head
itemid,label,fluid,category
50801,Alveolar-arterial Gradient,Blood,Blood Gas
50802,Base Excess,Blood,Blood Gas
50803,"Calculated Bicarbonate, Whole Blood",Blood,Blood Gas
50804,Calculated Total CO2,Blood,Blood Gas
50805,Carboxyhemoglobin,Blood,Blood Gas
50806,"Chloride, Whole Blood",Blood,Blood Gas
50808,Free Calcium,Blood,Blood Gas
50809,Glucose,Blood,Blood Gas
50810,"Hematocrit, Calculated",Blood,Blood Gas

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

Answer:

labevents <- arrow::open_dataset("~/mimic/hosp/labevents.parquet")
d_labitems <- read_csv("~/mimic/hosp/d_labitems.csv.gz")
Rows: 1622 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): label, fluid, category
dbl (1): itemid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sub_need <- as.vector(unique(icu_stays$subject_id))
labevent_need <- c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)
# Filter the colomns and rows that we are interested in
labevents <- labevents %>%
  filter(itemid %in% labevent_need)%>%
  filter(subject_id %in% sub_need)%>%
  select(subject_id, hadm_id, itemid, charttime, storetime, valuenum)
labevents_tble <- as_tibble(labevents)
# Merge the labevents_tble with icu_stays by subject_id and 
# make sure the storetime is before intime
merged_data <- labevents %>%
  merge(icu_stays, by = 'subject_id') %>%
  select(subject_id,itemid, charttime, storetime, intime, outtime,
         valuenum, stay_id, los) %>%
  filter(storetime < intime)

Since some multiple values of the same item with the same storetime, we use the one with the latest charttime. If still multiple values of the same item with the same storeitem and charttime, we can use either value or average.

subset_data <- merged_data %>%
  arrange(desc(storetime), desc(charttime)) %>%
  collect() %>%
  group_by(subject_id, stay_id, itemid) %>%
  slice_max(storetime) %>%
  select(-charttime) %>%
  summarize(avg_value = mean(valuenum, na.rm = TRUE), .groups = "drop") %>%
  collect()
subset_data_lab <- subset_data %>%
  inner_join(d_labitems, by = "itemid") %>%
  select(subject_id, stay_id, avg_value, label) %>%
  collect()
final_labevents_tble <- subset_data_lab %>%
  pivot_wider(names_from = label, values_from = avg_value)
final_labevents_tble
# A tibble: 68,467 × 10
   subject_id  stay_id Bicarbonate Chloride Creatinine Glucose Potassium Sodium
        <int>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000980 39765666          21      109        2.3      89       3.9    144
 3   10001217 34592300          30      104        0.5      87       4.1    142
 4   10001217 37067082          22      108        0.6     112       4.2    142
 5   10001725 31205490          NA       98       NA        NA       4.1    139
 6   10001884 37510196          30       88        1.1     141       4.5    130
 7   10002013 39060235          24      102        0.9     288       3.5    137
 8   10002155 31090461          23       98        2.8     117       4.9    135
 9   10002155 32358465          26       85        1.4     133       5.7    120
10   10002155 33685454          24      105        1.1     138       4.6    139
# ℹ 68,457 more rows
# ℹ 2 more variables: Hematocrit <dbl>, `White Blood Cells` <dbl>

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

Answer:

chartevents <- arrow::open_dataset("~/mimic/icu/chartevents.parquet")
vital_need <- c(220045, 220179, 220180, 223761, 220210)
# Filter the colomns and rows that we are interested in
chartevents_filter <- chartevents %>%
  filter(itemid %in% vital_need)%>%
  filter(subject_id %in% sub_need)%>%
  select(subject_id, stay_id, itemid, charttime, storetime, valuenum)
chartevents_tble <- as_tibble(chartevents_filter)
# Merge the chartevents_tble with icu_stays by subject_id and stay_id
merged_data_ <- chartevents_tble %>%
  left_join(icu_stays, by = c('subject_id','stay_id'))
# Make sure the charttime is after intime and before outtime
merged_data_ch <- merged_data_ %>%
  select(subject_id,itemid, charttime, storetime, intime, outtime,
         valuenum, stay_id) %>%
  filter(charttime >= intime) %>%
  filter(charttime <= outtime)
subset_data_ch <- merged_data_ch %>%
  arrange(subject_id, stay_id, itemid, charttime) %>%
  collect() %>%
  group_by(subject_id, stay_id, itemid) %>%
  slice_min(charttime) %>%
  summarize(avg_value = mean(valuenum, na.rm = TRUE), .groups = "drop") %>%
  collect()
subset_data_ch_lab <- subset_data_ch %>%
  inner_join(d_items, by = "itemid") %>%
  select(subject_id, stay_id, avg_value, label) %>%
  collect()
final_chartevents_tble <- subset_data_ch_lab %>%
  pivot_wider(names_from = label, values_from = avg_value)
final_chartevents_tble
# A tibble: 73,164 × 7
   subject_id stay_id `Heart Rate` Non Invasive Blood P…¹ Non Invasive Blood P…²
        <dbl>   <dbl>        <dbl>                  <dbl>                  <dbl>
 1   10000032  3.96e7           91                     84                     48
 2   10000980  3.98e7           77                    150                     77
 3   10001217  3.46e7           96                    167                     95
 4   10001217  3.71e7           86                    151                     90
 5   10001725  3.12e7           55                     73                     56
 6   10001884  3.75e7           38                    180                     12
 7   10002013  3.91e7           80                    104                     70
 8   10002155  3.11e7           94                    118                     51
 9   10002155  3.24e7           98                    109                     65
10   10002155  3.37e7           68                    126                     61
# ℹ 73,154 more rows
# ℹ abbreviated names: ¹​`Non Invasive Blood Pressure systolic`,
#   ²​`Non Invasive Blood Pressure diastolic`
# ℹ 2 more variables: `Respiratory Rate` <dbl>, `Temperature Fahrenheit` <dbl>

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

Answer:

Since the admission_tble has been modified in previous question, we need to re-import the data.

admissions_tble <- as_tibble(read_csv("~/mimic/hosp/admissions.csv.gz"))
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mimic_icu_cohort <- icustays_tble %>%
  left_join(admissions_tble, by =c("subject_id", "hadm_id")) %>%
  left_join(patients_tble, by = c("subject_id")) %>%
  left_join(final_labevents_tble, by = c("subject_id", "stay_id")) %>%
  left_join(final_chartevents_tble, by = c("subject_id", "stay_id"))
mimic_icu_cohort$age_intime <- as.numeric(format(mimic_icu_cohort$intime, "%Y"))-mimic_icu_cohort$anchor_year+mimic_icu_cohort$anchor_age
mimic_icu_cohort <- mimic_icu_cohort %>%
  filter(age_intime >= 18)
mimic_icu_cohort
# A tibble: 73,181 × 41
   subject_id  hadm_id  stay_id first_careunit last_careunit intime             
        <dbl>    <dbl>    <dbl> <chr>          <chr>         <dttm>             
 1   10000032 29079034 39553978 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 2   10000980 26913865 39765666 Medical Inten… Medical Inte… 2189-06-27 08:42:00
 3   10001217 24597018 37067082 Surgical Inte… Surgical Int… 2157-11-20 19:18:02
 4   10001217 27703517 34592300 Surgical Inte… Surgical Int… 2157-12-19 15:42:24
 5   10001725 25563031 31205490 Medical/Surgi… Medical/Surg… 2110-04-11 15:52:22
 6   10001884 26184834 37510196 Medical Inten… Medical Inte… 2131-01-11 04:20:05
 7   10002013 23581541 39060235 Cardiac Vascu… Cardiac Vasc… 2160-05-18 10:00:53
 8   10002155 20345487 32358465 Medical Inten… Medical Inte… 2131-03-09 21:33:00
 9   10002155 23822395 33685454 Coronary Care… Coronary Car… 2129-08-04 12:45:00
10   10002155 28994087 31090461 Medical/Surgi… Medical/Surg… 2130-09-24 00:50:00
# ℹ 73,171 more rows
# ℹ 35 more variables: outtime <dttm>, los <dbl>, admittime <dttm>,
#   dischtime <dttm>, deathtime <dttm>, admission_type <chr>,
#   admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
#   hospital_expire_flag <dbl>, gender <chr>, anchor_age <dbl>, …

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

  • Length of ICU stay los vs the last available lab measurements before ICU stay

  • Length of ICU stay los vs the average vital measurements within the first hour of ICU stay

  • Length of ICU stay los vs first ICU unit

Answer:

Length of ICU stay los vs demographic variables:

mimic_icu_cohort$race <- as.factor(mimic_icu_cohort$race)
summary_by_race <- tapply(mimic_icu_cohort$los, mimic_icu_cohort$race, summary)
summary_by_race_df <- as.data.frame(do.call(rbind, summary_by_race))
summary_by_race_df
                                                 Min.   1st Qu.   Median
AMERICAN INDIAN/ALASKA NATIVE             0.087870370 1.1100000 2.093432
ASIAN                                     0.063125000 1.0786227 1.925845
ASIAN - ASIAN INDIAN                      0.160324074 1.0683681 1.945602
ASIAN - CHINESE                           0.020601852 1.0096383 1.864167
ASIAN - KOREAN                            0.211354167 1.3458623 2.077384
ASIAN - SOUTH EAST ASIAN                  0.078368056 1.0926736 1.822500
BLACK/AFRICAN                             0.034502315 1.0620023 2.021470
BLACK/AFRICAN AMERICAN                    0.003460648 1.0152836 1.846933
BLACK/CAPE VERDEAN                        0.029583333 0.9443229 1.704902
BLACK/CARIBBEAN ISLAND                    0.044502315 1.0774103 1.945365
HISPANIC OR LATINO                        0.088495370 0.9816146 1.629225
HISPANIC/LATINO - CENTRAL AMERICAN        0.472175926 1.1429167 1.974664
HISPANIC/LATINO - COLUMBIAN               0.598356481 1.0245313 1.926285
HISPANIC/LATINO - CUBAN                   0.398946759 1.2471672 1.852355
HISPANIC/LATINO - DOMINICAN               0.049201389 1.1246123 2.085556
HISPANIC/LATINO - GUATEMALAN              0.002418981 1.0781105 1.801082
HISPANIC/LATINO - HONDURAN                0.009722222 1.0828241 1.801076
HISPANIC/LATINO - MEXICAN                 0.111701389 1.0455700 1.577731
HISPANIC/LATINO - PUERTO RICAN            0.044687500 1.0940278 1.906505
HISPANIC/LATINO - SALVADORAN              0.143402778 0.9516262 1.713767
MULTIPLE RACE/ETHNICITY                   0.232615741 1.1608015 1.896516
NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER 0.303125000 0.9189670 1.757031
OTHER                                     0.005983796 1.0887789 1.915284
PATIENT DECLINED TO ANSWER                0.073206019 1.0910648 1.767859
PORTUGUESE                                0.054618056 1.1350058 2.136418
SOUTH AMERICAN                            0.582766204 1.1767130 2.230266
UNABLE TO OBTAIN                          0.004421296 1.1605961 2.093351
UNKNOWN                                   0.018136574 1.1821528 2.177662
WHITE                                     0.001250000 1.0868287 1.919560
WHITE - BRAZILIAN                         0.027187500 1.0914641 1.807141
WHITE - EASTERN EUROPEAN                  0.129918981 1.1400984 1.714705
WHITE - OTHER EUROPEAN                    0.002268519 1.1111169 1.910775
WHITE - RUSSIAN                           0.010462963 1.0398872 1.829554
                                              Mean  3rd Qu.       Max.
AMERICAN INDIAN/ALASKA NATIVE             4.455923 5.116892  49.905231
ASIAN                                     3.485844 3.777025  59.467303
ASIAN - ASIAN INDIAN                      4.199515 3.873767  66.267338
ASIAN - CHINESE                           3.358139 3.290955  77.740706
ASIAN - KOREAN                            4.230412 4.193021  39.414225
ASIAN - SOUTH EAST ASIAN                  3.039018 2.994259  76.086458
BLACK/AFRICAN                             3.816520 4.106296  54.843808
BLACK/AFRICAN AMERICAN                    3.275962 3.447841  95.838218
BLACK/CAPE VERDEAN                        3.215837 3.478137  51.660475
BLACK/CARIBBEAN ISLAND                    3.610511 3.529436  38.248229
HISPANIC OR LATINO                        2.732006 2.955434  47.050139
HISPANIC/LATINO - CENTRAL AMERICAN        3.407362 3.723605  18.713877
HISPANIC/LATINO - COLUMBIAN               3.014349 3.331999  20.780845
HISPANIC/LATINO - CUBAN                   2.948106 3.942682  19.101505
HISPANIC/LATINO - DOMINICAN               3.855573 4.077216  55.985312
HISPANIC/LATINO - GUATEMALAN              2.955333 2.863116  38.990278
HISPANIC/LATINO - HONDURAN                2.700771 2.978044  15.011400
HISPANIC/LATINO - MEXICAN                 2.648043 2.673032  18.518067
HISPANIC/LATINO - PUERTO RICAN            3.507275 3.706293  44.176065
HISPANIC/LATINO - SALVADORAN              3.417620 3.312778  58.258067
MULTIPLE RACE/ETHNICITY                   2.773564 3.045660  25.822303
NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER 3.317625 3.332922  53.484248
OTHER                                     3.548833 3.969803  55.720220
PATIENT DECLINED TO ANSWER                2.956491 3.005266  29.318380
PORTUGUESE                                4.256530 4.135917  61.873889
SOUTH AMERICAN                            2.759191 4.017118   8.560949
UNABLE TO OBTAIN                          4.223386 4.127538  86.848403
UNKNOWN                                   4.201132 4.706528  99.638449
WHITE                                     3.368996 3.642280 110.232280
WHITE - BRAZILIAN                         3.413554 3.004803  50.051910
WHITE - EASTERN EUROPEAN                  2.882399 2.870663  27.274074
WHITE - OTHER EUROPEAN                    3.664809 3.639184 103.499005
WHITE - RUSSIAN                           2.903664 3.362731  32.809676
ggplot(mimic_icu_cohort, aes(x = str_wrap(race, width = 15), y = los)) +
  geom_boxplot(trim = FALSE,  outlier.size = 0.5) +
  labs(title = "Distribution of ICU Length of Stay by Race",
       x = "Race",
       y = "Length of Stay (days)") +
  theme(axis.text.x = element_text(angle = 35, vjust = 1, hjust=1, size = 4))
Warning in geom_boxplot(trim = FALSE, outlier.size = 0.5): Ignoring unknown
parameters: `trim`

From the graph, we can see that the distribution of ICU length of stay of different race are very similar, which are very right-skewed, with most of the patients (75%) stay in the ICU for less than 5 days.

mimic_icu_cohort$insurance <- as.factor(mimic_icu_cohort$insurance)
summary_by_insurance <- tapply(mimic_icu_cohort$los, mimic_icu_cohort$insurance, summary)
summary_by_insurance_df <- as.data.frame(do.call(rbind, summary_by_insurance))
summary_by_insurance_df
                Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
Medicaid 0.006250000 1.018889 1.811481 3.463981 3.475891  86.8484
Medicare 0.001446759 1.116418 2.006181 3.483231 3.824057 103.4990
Other    0.001250000 1.069361 1.865891 3.420036 3.620098 110.2323
ggplot(mimic_icu_cohort, aes(x = insurance, y = los)) +
  geom_boxplot(trim = FALSE, outlier.size = 0.5) +
  labs(title = "Distribution of ICU Length of Stay by Insurance",
       x = "Insurance",
       y = "Length of Stay (days)") +
  theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1, size = 6))
Warning in geom_boxplot(trim = FALSE, outlier.size = 0.5): Ignoring unknown
parameters: `trim`

From the graph, we can see that the distribution of ICU length of stay of different insurance are very similar, which are very right-skewed, with most of the patients (75%) stay in the ICU for about 3 days.

mimic_icu_cohort$marital_status <- as.factor(mimic_icu_cohort$marital_status)
summary_by_marital_status <- tapply(mimic_icu_cohort$los, mimic_icu_cohort$marital_status, summary)
summary_by_marital_status_df <- as.data.frame(do.call(rbind, summary_by_marital_status))
summary_by_marital_status_df
                Min.  1st Qu.   Median     Mean  3rd Qu.      Max.
DIVORCED 0.007500000 1.083330 1.902338 3.385371 3.731606  91.01376
MARRIED  0.001250000 1.103747 1.935098 3.455288 3.664135 110.23228
SINGLE   0.002268519 1.039543 1.872078 3.401897 3.633429  95.83822
WIDOWED  0.007326389 1.087248 1.934479 3.125789 3.499381  86.31249
ggplot(mimic_icu_cohort, aes(x = marital_status, y = los)) +
  geom_boxplot(trim = FALSE, outlier.size = 0.5) +
  labs(title = "Distribution of ICU Length of Stay by Marital Status",
       x = "Marital Status",
       y = "Length of Stay (days)") +
  theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1, size = 8))
Warning in geom_boxplot(trim = FALSE, outlier.size = 0.5): Ignoring unknown
parameters: `trim`

From the graph, we can see that the distribution of ICU length of stay of different marital status are very similar, which are very right-skewed, with most of the patients (75%) stay in the ICU for about 3 days.

mimic_icu_cohort$gender <- as.factor(mimic_icu_cohort$gender)
summary_by_gender <- tapply(mimic_icu_cohort$los, mimic_icu_cohort$gender, summary)
summary_by_gender_df <- as.data.frame(do.call(rbind, summary_by_gender))
summary_by_gender_df
         Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
F 0.001446759 1.052002 1.911053 3.342715 3.649201 110.2323
M 0.001250000 1.108356 1.940694 3.538524 3.763284 103.4990
ggplot(mimic_icu_cohort, aes(x = gender, y = los)) +
  geom_boxplot(trim = FALSE, outlier.size = 0.5) +
  labs(title = "Distribution of ICU Length of Stay by Gender",
       x = "Gender",
       y = "Length of Stay (days)") +
  theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust=1, size = 8))
Warning in geom_boxplot(trim = FALSE, outlier.size = 0.5): Ignoring unknown
parameters: `trim`

From the graph, we can see that the distribution of ICU length of stay of different gender are very similar, which are very right-skewed, with most of the patients (75%) stay in the ICU for about 3 days.

plot <- ggplot(mimic_icu_cohort, aes(x = age_intime, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Age",
       x = "Age",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
plotly_plot

From the graph, we can see that the distribution of ICU length of stay versus age is relatively balanced, with most of the patients (75%) stay in the ICU for less than 30 days, very a few patients stay in the ICU for more than 60 days.

length of ICU stay los vs the last available lab measurements before ICU stay:

plot <- ggplot(mimic_icu_cohort, aes(x = Creatinine, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Creatinine",
       x = "Creatinine",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 5770 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph we can see that the most patient have creatinine level less than 5, and the length of ICU stay is relatively balanced. Very few patient have creatinine level more than 20, and the length of ICU stay is relatively shorter for those patients.

plot <- ggplot(mimic_icu_cohort, aes(x = Potassium, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Potassium",
       x = "Potassium",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 8901 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph we can see that for patient have potassium level less than 3.5 or greater than 6.5 always have a relatively shorter length of ICU stay, usually less than 30. However, for patient have potassium level between 3.5 and 6.5, the length of ICU stay is relatively right-skewed, with a long tail of staying more than 60 days.

plot <- ggplot(mimic_icu_cohort, aes(x = Sodium, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Sodium",
       x = "Sodium",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 8872 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph we can see that for patient have sodium level less than 125 or greater than 150 always have a relatively shorter length of ICU stay, usually less than 30. However, for patient have sodium level between 130 and 150, the length of ICU stay is relatively right-skewed, with a long tail of staying more than 60 days.

plot <- ggplot(mimic_icu_cohort, aes(x = Chloride, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Chloride",
       x = "Chloride",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 8883 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph we can see that for patient have chloride level less than 90 or greater than 110 always have a relatively shorter length of ICU stay, usually less than 30. However, for patient have chloride level between 90 and 110, the length of ICU stay is relatively right-skewed, with a long tail of staying more than 60 days.

plot <- ggplot(mimic_icu_cohort, aes(x = Bicarbonate, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Bicarbonate",
       x = "Bicarbonate",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 9050 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph we can see that for patient have bicarbonate level less than 15 or greater than 35 always have a relatively shorter length of ICU stay, usually less than 30. However, for patient have bicarbonate level between 15 and 35, the length of ICU stay is relatively right-skewed, with a long tail of staying more than 60 days.

plot <- ggplot(mimic_icu_cohort, aes(x = Hematocrit, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Hematocrit",
       x = "Hematocrit",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 5017 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph we can see that for patient have hematocrit level less than 20 or greater than 50 always have a relatively shorter length of ICU stay, usually less than 30. However, for patient have hematocrit level between 20 and 50, the length of ICU stay is relatively right-skewed, with a long tail of staying more than 60 days.

plot <- ggplot(mimic_icu_cohort, aes(x = `White Blood Cells`, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by White Blood Cell Count",
       x = "White Blood Cell Count",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 5094 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph we can see that most patient have white blood cell count less than 50 and have a relatively right-skewed length of ICU stay. However, for patient have white blood cell count between 50 and 100, the length of ICU stay is relatively balanced, with most of the patients stay in the ICU for less than 30 days. Only a few patient have white blood cell count more than 400, and the length of ICU stay is relatively shorter for those patients.

plot <- ggplot(mimic_icu_cohort, aes(x = Glucose, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Glucose",
       x = "Glucose",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 9099 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph we can see that most patient have glucose level less than 200 and have a relatively right-skewed length of ICU stay. However, for patient have glucose level between 200 and 1500, the length of ICU stay is relatively balanced, with most of the patients stay in the ICU for less than 30 days. Only a few patient have glucose level more than 1500, and the length of ICU stay is relatively shorter for those patients.

Length of ICU stay los vs the average vital measurements within the first hour of ICU stay:

plot <- ggplot(mimic_icu_cohort, aes(x = `Heart Rate`, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Heart Rate",
       x = "Heart Rate",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 18 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph, we can see that the distribution of ICU length of stay versus heart rate is relatively right-skewed. Most patient have Heart Rate around 75 and stay for about 3 days, and it’s unusual that some patients have a heart rate that is more than 200.

plot<- ggplot(mimic_icu_cohort, aes(x = `Non Invasive Blood Pressure systolic`, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Systolic Blood Pressure",
       x = "Systolic Blood Pressure",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 979 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph, we can see that the distribution of ICU length of stay versus systolic blood pressure is relatively right-skewed. Most patient have Systolic Blood Pressure around 120 and stay for about 3 days, and it’s unusual that some patients have a Systolic Blood Pressure that is more than 12500

plot<- ggplot(mimic_icu_cohort, aes(x = `Non Invasive Blood Pressure diastolic`, 
                                    y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Diastolic Blood Pressure",
       x = "Diastolic Blood Pressure",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 983 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph, we can see that the distribution of ICU length of stay versus diastolic blood pressure is relatively right-skewed. Most patient have Diastolic Blood Pressure around 80 and stay for about 3 days, and it’s unusual that some patients have a Diastolic Blood Pressure that is more than 60000.

plot <- ggplot(mimic_icu_cohort, aes(x = `Temperature Fahrenheit`, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Temperature",
       x = "Temperature",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 1362 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph, we can see that the distribution of ICU length of stay versus temperature is relatively right-skewed Most patient have temperature around 98 and stay for about 3 days, and it’s unusual that some patients have a temperature that is more than 1000.

plot <- ggplot(mimic_icu_cohort, aes(x = `Respiratory Rate`, y = los)) +
  geom_hex() +
  labs(title = "Distribution of ICU Length of Stay by Respiratory Rate",
       x = "Respiratory Rate",
       y = "Length of Stay (days)") +
  theme_minimal()

plotly_plot <- ggplotly(plot)
Warning: Removed 98 rows containing non-finite values (`stat_binhex()`).
plotly_plot

From the graph, we can see that the distribution of ICU length of stay versus respiratory rate is relatively right-skewed. Most patient have respiratory rate around 20 and stay for about 3 days, and it’s unusual that some patients have a respiratory rate that is more than 75.

Length of ICU stay los vs first ICU unit:

mimic_icu_cohort$first_careunit <- as.factor(mimic_icu_cohort$first_careunit)
summary_by_first_careunit <- tapply(mimic_icu_cohort$los, mimic_icu_cohort$first_careunit, summary)
summary_by_first_careunit_df <- as.data.frame(do.call(rbind, summary_by_first_careunit))
summary_by_first_careunit_df
                                                        Min.   1st Qu.   Median
Cardiac Vascular Intensive Care Unit (CVICU)     0.003518519 1.2134664 1.988229
Coronary Care Unit (CCU)                         0.001250000 1.0891319 2.014063
Medical Intensive Care Unit (MICU)               0.002268519 0.9804572 1.829288
Medical/Surgical Intensive Care Unit (MICU/SICU) 0.001446759 1.0151620 1.808368
Neuro Intermediate                               0.003460648 1.1225405 2.335116
Neuro Stepdown                                   0.057002315 0.8905787 1.686250
Neuro Surgical Intensive Care Unit (Neuro SICU)  0.021574074 1.7941406 3.651233
Surgical Intensive Care Unit (SICU)              0.007453704 1.0798958 1.974583
Trauma SICU (TSICU)                              0.002175926 1.0318953 1.939421
                                                     Mean  3rd Qu.      Max.
Cardiac Vascular Intensive Care Unit (CVICU)     3.290002 3.295211 103.49900
Coronary Care Unit (CCU)                         3.192368 3.725990  66.58424
Medical Intensive Care Unit (MICU)               3.263419 3.522914 110.23228
Medical/Surgical Intensive Care Unit (MICU/SICU) 3.084816 3.193600  67.87067
Neuro Intermediate                               3.420690 4.246418  43.51039
Neuro Stepdown                                   2.590663 3.163588  29.10352
Neuro Surgical Intensive Care Unit (Neuro SICU)  6.299043 7.743111  59.56295
Surgical Intensive Care Unit (SICU)              3.837683 4.026551 101.72624
Trauma SICU (TSICU)                              3.833085 3.983388  99.63845
ggplot(mimic_icu_cohort, aes(x = first_careunit, y = los)) +
  geom_boxplot(trim = FALSE, outlier.size = 0.5) +
  labs(title = "Distribution of ICU Length of Stay by First Care Unit",
       x = "First Care Unit",
       y = "Length of Stay (days)") +
  theme(axis.text.x = element_text(angle = 10, vjust = 1, hjust=1, size = 6))
Warning in geom_boxplot(trim = FALSE, outlier.size = 0.5): Ignoring unknown
parameters: `trim`

From the graph, we can see that the distribution of ICU length of stay of different first care unit are very similar, which are very right-skewed, with most of the patients (75%) stay in the ICU for about 3 days.